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ABSTRACT 

This paper presents two efficient mapping algorithms for scheduling the robot inverse dynamics computation 
consisting of m computational modules with precedence relationship to be executed on a multiprocessor system con- 
sisting of p identical homogeneous processors with processor and communication costs to achieve minimum computa- 
tion time. An objective function is defined in terms of the sum of the processor finishing time and the interprocessor 
communication time. The minimax optimization is performed on the objective function to obtain the best mapping. 
This mapping problem can be formulated as a combination of the graph partitioning and the scheduling problems, 
both have been known to be AF-complete. Thus, to speed up the searching for a solution, two heuristic algorithms 
were proposed to obtain fast but suboptimal mapping solutions. The first algorithm utilizes the level and the commun- 
ication intensity of the task modules to construct an ordered priority list of ready modules and the module assignment 
is performed by a weighted bipartite matching algorithm. For a near-optimal mapping solution, the problem can be 
solved by the heuristic algorithm with simulated annealing. These proposed optimization algorithms can solve various 
large-scale problems within a reasonable time. Computer simulations were performed to evaluate and verify the per- 
formance and the validity of the proposed mapping algorithms. Finally, experiments for computing the inverse 
dynamics of a six-jointed PUMA-like manipulator based on the Newton-Euler dynamic equations were implemented 
on an NCUBE/ten hypercube computer to verify the proposed mapping algorithms. Computer simulation and experi- 
mental results are compared and discussed. 

1. Introduction 

Robot manipulators are highly nonlinear systems, and their motion control involves the computation of the 
required generalized forces/torques from an appropriate manipulator dynamics model. There are a number of ways to 
compute the generalized forces/torques among which the computation of joint torques from the Newton-Euler (NE) 
equations of motion [1,2] is the most efficient and has been shown to possess the time lower bound of 0(n ) running in 
uniprocessor computers [3], where n is the number of degrees-of-freedom of the manipulator. It is unlikely that 
further substantial improvements in computational efficiency can be achieved. Nevertheless, some improvements 
could be achieved by taking advantage of particular computation structures [4], customized algorithms/architectures 
for specific manipulators [5,6], parallel computations [3,7], and scheduling algorithms for multiprocessor systems [8- 

H]. 

This paper presents two efficient mapping algorithms for scheduling the robot inverse dynamics computation of 
an n-jointed manipulator to be executed on a multiprocessor system with processor finishing time and interprocessor 
communication time considered in the system. The NE equations of motion are decomposed into m computational 
modules which are scheduled to be executed on p identical homogeneous processors to achieve minimum computation 
time. Several approaches to the general mapping problem have been proposed [13-18], In this paper, an objective 
function is defined in terms of the sum of the processor finishing time and the interprocessor communication time. 
The minimax optimization of the objective function is performed to obtain the best mapping. This mapping problem 
is formulated as a combination of the graph partitioning and the scheduling problems; both have been known to be 
AT* -complete. Thus, we first propose an efficient heuristic algorithm to obtain a fast but suboptimal solution. The 
heuristic algorithm utilizes the level and the communication intensity of the task modules to construct an ordered 
priority list of ready modules, and the module assignment is performed by a weighted bipartite matching algorithm. 
For a near-optimal mapping solution, the problem can be solved by a simulated annealing method with an efficient 
lower bound which indicates the minimum-finishing-time of the special scheduling problem. The simulated annealing 
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algorithm is a statistical optimization method which can solve various large-scale combinatorial optimization prob- 
lems within a reasonable time. This approach transforms the mapping problem into a combined graph partitioning and 
scheduling problem. A partition of the task graph is first performed, and then the modules in each "block" of the parti- 
tion are scheduled to be executed by the processor assigned to that block of the partition. Computer simulations and 
experimental results for computing the inverse dynamics of a six-jointed PUMA manipulator on an NCUBE/ten 
hypercube computer are compared and discussed. 

2. Problem Formulation and Objective Function 

The problem of computing manipulator joint torques based on a manipulator dynamic model is often referred to 
as the inverse dynamics problem and can be stated as: Given the joint positions and velocities {qj(t ) , qj(t) }* =1 which 
describe the state of an n-jointcd manipulator at time t, together with the joint accelerations {^,(/)}"=i which are 
desired at that time, solve the dynamic equations of motion for the joint torques {t, (0}"=j as follow: 

t(/) = f (q(/), q(r), q(0) (1) 

where T(0 = (T,,T 2 . ••• ,t,) r , q(/) = (q u q 2 , ,q K f, q(t) = (?i, q 2 . ■" ,q nf, q(0 = («i.f 2 . ' ,q*) T > 

the superscript T denotes transpose operation on vectors and matrices, and Eq. (1) indicates the functional representa- 
tion of the manipulator dynamic model. Since the NE equations of motion have been known for their efficiency in 
computing the joint torques, our objective is to see how fast one can map the computation of the NE equations of 
motion on a multiprocessor system with p identical processors to achieve minimum computation time. 

2.1. Representation of System Model 

In general, a computational task can be represented by a directed acyclic task graph (DATG) G c = (V c , E c ) 
consisting of a finite nonempty set of vertices V c ,V c =[T k I T k e G c , k = 1 , 2 , • • • , m } , and a set of finite edges, 
E c = {eq 1 eij eC t ). connecting them. Each vertex represents a computational module (CM), and each edge 
represents the precedence constraint between two CMs. An edge connecting module T t to module 7} is denoted by 
e c ij. The precedence constraint between CMs indicates which modules have to be completed before some other 
modules can be started. The ordered pair (7) , £>,) t is introduced for labeling the module T t which means that module 
I/ requires D, units of execution time for completion. Similarly, a multiprocessor system can be defined by an 
undirected doubly weighted processor graph (UDPG) G p = (V p , E p ), where V p is a finite nonempty set of vertices 
denoting processors, \V p \=p*e>, and E p = {(e*,^) I e'.e' e G p ) is a set of finite double-weighted edges. 
( e fj’ e ij) is an ordered pair associated with the edge connecting processor i to processor j in E p , where e s tJ indicates the 
message sending time incurred on processor x, and e r tJ indicates the message receiving time incurred on processor j, if 
the necessary message communication is required to transfer from processor i to processor j. If we represent the send- 
ing time among the processors in the system in a matrix SM, then the (t , j) entry of this matrix, SM(i , j), indicates the 
sending time between the processors i and j. Similarly, the matrix RM can be used to indicate the receiving time 
among the processors, and the (t , j) entry of this matrix, RM(i , j), is the receiving time between processors i and j. 
In this paper, we assume that these two matrices are symmetric and that the diagonal elements of these matrices are 
zero to indicate the negligible message communication time within the same processor. Figure 1 shows a task graph, a 
processor graph, and their corresponding SM and RM matrices. 

For a given DATG, if there is an edge from module i to module j, then module i is said to be an immediate 
predecessor of module j, and we denote it as IPRED(j ) = i. If there is a directed path from module i to module j, then 
module i is said to be a predecessor of module j, and we denote it as PRED(J) = i. Initial modules are those modules 
with no predecessors, and terminal modules are those modules with no successors. The level /, of a module 1) is the 
summation of the execution time associated with the modules in a path from 7) to a terminal module such that this sum 
is maximal. Such a path is called the critical path if the module 7) is the highest level in the DATG [12], and we 
define the critical path length as 

D 4 max /, (2) 

F r, s v, 

where D cp is the minimum possible finishing time for the multiprocessors to process all the modules in the given 
DATG. The physical meaning of the critical path, whichever scheduling method is employed, is the finishing time 
over all permissible schedules and cannot be shorter than the D cp . 

2.2. Processor Finishing Time and Interprocessor Communication Time 

Two important parameters are usually considered in the mapping problem: the processor finishing time (PFT) 
and the interprocessor communication time (PCT). The processor finishing time of a processor k for a certain 
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mapping S, PFT k (S) y 1 £ k <* p, is the accumulated execution time of the modules assigned to that processor. If there 
exists some precedence constraint among these modules, then the PFT k {S) of processor k has to take the processor idle 
time into consideration. Determining the minimum processor finishing time with the precedence constraint taken into 
consideration is known as the minimum-finishing-time scheduling problem. The interprocessor communication time 
occurs only when two communicating modules are assigned to different processors such that the two modules residing 
on different processors can communicate through a communication channel or bus [15]. The message sent from one 
processor to other receiving processors causes the communication overhead among the sending and receiving proces- 
sors. This results in the PCT which requires processing load on both the sending and receiving processors. For exam- 
ple, a specific module is sending a message from processor i to processor j. The sending processor i needs to spend 
time on message formatting and addresses initialization to the destination. Meanwhile, the receiving processor j needs 
to spend time on extracting the message contents from the sending processor. The message extracting time is usually 
much longer than the sending time which results in heavy traffic in the communication channel and causes the block- 
ing of further messages arriving into the receiving processor j. Such PCT overhead can be eliminated if the two com- 
municating modules are assigned to the same processor because both modules would locate at the same local memory, 
and the message communication time within the same processor is negligible and can be ignored. 

2.3. Objective Function and Optimization Criterion 

For most multiprocessor systems, minimizing the maximum processor finishing time is the most important per- 
formance measure and has the effect of evenly distributing the modules to all the processors to achieve the shortest 
possible computation time. In a task graph with the precedence constraint imposed on the modules, satisfying this 
mapping performance is known as the minimum-finishing-time scheduling problem. However, this performance cri- 
terion causes a negative effect which results in heavy communication costs among the processors (i.e., PCT). Minim- 
izing the PCT alone may not produce a good assignment either because in a homogeneous system where all the pro- 
cessors are identical, a minimum PCT cost assignment will assign all the modules to a single processor, thus achieving 
zero PCT! The conflict of mapping performance between PFT and PCT has been studied [13]. 

The mapping problem is to find an optimal matching between a task graph and a processor graph, and this prob- 
lem can be considered as a partition of the given DATG into several blocks with the modules in each block assigned 
to be executed by a corresponding processor. Figure 2 illustrates a partition of a DATG into three blocks to be exe- 
cuted on a three-processor computer system. Efficiency of a mapping can be improved by balancing the task modules 
among all the processors and minimizing the weight of the edges across the boundaries of the blocks which 
correspond to the communication costs among the processors. 

Let S be a certain mapping and ft be the set of all the mappings. Let <(>*, 1 < k </>, be a partition of the given 

DATG such that (jty k = K and $ h <t>* = 0, h±k . The PCT incurred by processor k in the kth block, <)>*, is given 
*=1 
by 


PCT k (S)= £ etfij + eWi \ 
,/.**♦* 


(3) 


where modules i, j\ and k are related by I PR ED (j) = i and IPRED(i) - /, and 

? [ 0 , i(IPRED(j)*i 

e ‘ j ~ j 1 , if IPRED(J) = i . 


Let E k (S) y 1 < k < p, be the processor response time spent in processor k which consists of the sum of the pro- 
cessor finishing time PFT k (S) and the interprocessor communication time PCT k (S) y 

E k (S) = P FT k (S) + PCT k (S) , (5) 

where the PCT k (S) is defined as in Eq. (3), and the PFT k (S) is the accumulated execution time of the modules 
assigned to that processor. The PFT k (S) is not simply the summation of the modules’ execution time assigned to pro- 
cessor k\ instead, it must include the processor idle time of that processor due to the precedence constraint. Usually, 
this value is greater than the summation of the modules* execution time assigned to that processor. If the system is 
initialized, the PFT k (S ) is defined as the finishing time of executing the last module assigned to processor k. 

Let E(S) be the maximum processor response time. For a mapping S, Se ft, the minimum-response-time map- 
ping is the mapping S * that minimizes the E(S) y that is, 

E(S*) = min E(S) = min ma x(PFT k (S) + PCT k (S) ) . (6) 

Se Q Se Cl\ZkZp 

This means that we want to minimize the maximum processor response time, resulting in the minimax optimization 
criterion. 
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2.4. Task Graph of Newton-Euler Equations of Motion 

To achieve parallel processing with minimum response time, it is desirable to develop a directed task graph with 
maximum parallelism for the NE equations of motion. Unfortunately, a maximum -parallelism NE task graph may not 
yield a minimum-response-time mapping when the interprocessor communication time is taken into consideration. 
Thus, we perform a functional decomposition of the NE equations; that is, the equations are decomposed into compu- 
tational modules, each of which calculates the kinematic and dynamic variables such as angular velocities, angular 
and linear accelerations, joint forces and moments, etc. The recursive structure of NE formulation with respect to the 
link coordinate systems is found to be in an inhomogeneous linear recurrence form (IHLR) which is not efficient for 
parallel processing [3,11]. On the other hand, when expressed in the base coordinate system, the NE equations are in 
a homogeneous linear recurrence form (HLR) which is more suitable for parallel processing [3,11]. The NE equations 
of motion expressed as HLR form with the detailed task modules for our mapping problem are shown in Fig. 3. For a 
six-jointed, PUMA-like manipulator, this task graph shows that the NE equations can be decomposed into 145 task 
modules [24], 

Using the minimax optimization criterion as in Eq. (6), the mapping between the above NE task graph and a set 
of connected processors will be investigated. Our proposed mapping strategies for a multiprocessor system are sub- 
stantially different from previous work, and their contributions can be seen in the following: (1) the precedence con- 
straint and unequal module execution time of the task modules in the task graph are considered in the system; (2) 
unequal communication costs among processors in the processor graph are considered in the objective function; and 
(3) the minimax optimization criterion is used to obtain the minimum-response-time mapping. The proposed mapping 
algorithms are very general and can be applied to most multiprocessor systems. 

3. The Mapping Algorithm 

The mapping problem can be considered as decomposing a computational task into a set of task modules exe- 
cuted concurrently on a multiprocessor system. The design of this mapping is divided into two major steps: graph par- 
titioning and module allocation. Both of these two steps are known to be NP-complete. Thus, we propose two heuris- 
tic mapping algorithms to obtain fast mapping solutions for computing the NE equations of motion. Both heuristic 
algorithms take the PFT, the PCT, the precedence constraint of the NE task graph, and the multiprocessor system 
structure into consideration. 

3.1. Heuristic Mapping Algorithm with Weighted Bipartite Matching 

Based on the given NE task graph in Fig. 3, the heuristic algorithm constructs a dynamic priority list containing 
all the task modules arranged in a descending order according to the weighted level /, and the communication intensity 
of the task modules. Most of the priority lists developed in previous research do not include the communication costs 
[9,11]. To develop the dynamic priority list, let us denote A (n) to be a set of modules that have been assigned to the 
processors at the nth insertion ^tage (i.e., the modules that have been inserted into the mapping-schedule from the 
dynamic priority list), and let A(n) be the complement of A(n). Let P^,(n) be the processors) with the minimum 
finishing time at this stage, and let K (n) denote the set of modules assigned to the remaining processors which have 
not yet finished processing. Let FW ( A(n)) be the function that returns the set of modules, W(n), which are ready to 
be assigned to all the p processors, that is, for all modules 7] e A(n), if and only if PRED ( T t ) £A(n). Similarly, the 
function FW ( K ( n ) ^ jA(n )) - K (n) returns the set of modules, R (n), which are ready to be assigned to the 
These notations will be useful for developing the proposed heuristic algorithm. 

The mapping-schedule obtained from this heuristic algorithm starts from zero initially and gradually the ready 
modules are “inserted” into the mapping-schedule until all the modules have been inserted. Instead of randomly 
inserting the ready modules into the available processors P^{n), the insertion of ready modules needs to consider the 
increased communication costs they created. In order to minimize the maximum processor response time, this inser- 
tion of ready modules has to be carefully selected so that it does not increase the PCT while maintaining the 
minimum-finishing-time schedule. This insertion process (or module allocation process) can be done by a weighted 
Bipartite Matching Algorithm [20] which will be discussed later. Using the weighted bipartite matching algorithm at 
each stage of the insertion of ready modules into the mapping-schedule, the PCT will be maintained to be the 
minimum. 

Minimizing the maximum processor response time also requires us to consider the PFT at each stage of the 
insertion of ready modules. Due to the precedence constraint of the task graph, some idle modules might have to be 
assigned to the available processors during the insertion of the ready modules. Table 1 shows the PFT of the 
mapping-schedule in Fig. 1 when the new non-idle module is inserted to that processor. In order to include both the 
PFT and the PCT in the construction of the dynamic priority list to order the ready modules, we introduce two parame- 
ters, a and p, for weighting the level /, and the communication intensity of task modules, respectively, in adjusting the 
assignment priority coefficient p ( . The ready modules are ordered into a priority list according to the descending order 
of the assignment priority coefficient p„. Then the ready modules in the priority list can be allocated to the available 
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processors using the weighted bipartite matching algorithm. 

Algorithm HM ( Heuristic Mapping Algorithm). Given a DATG and an UDPG, this algorithm constructs a 
dynamic priority list of all the task modules and inserts the modules one by one into the mapping-schedule according 
to the weighted bipartite matching algorithm. 

BBH1. [Initialization.] Input the given DATG. Obtain the critical path and the total number of edges, nedge, in 
the DATG, where nedge = I E c I . 

H2, [Initialize Looping.] Set the mapping-schedule empty (i.e., A(n) = 0 ); initialize the processor finishing time 
PFT and the interprocessor communication time PCT. Create two parameters aandP(a + p= l), a from one 
to zero and P from zero to one with an increment/decrement ASTEP. Set n <— 1. 

H3. [Determine the level of modules in R («).] Determine the set of ready modules R(n), and find the level /, of each 
module T, in the set R (n). 

H4. [Obtain the successors and the predecessors number.] For each module 7] in the set R(n), find its number of 
successors, ns;, and its number of predecessors, np,-. 

H5. [Evaluate the assignment priority coefficient.] Evaluate the assignment priority coefficient, p„ of module 7] 
from the equation 

•" herca+|i ' 1 ' (7) 

H6. [Order the priority list.] For all the ready modules in the set R(n), construct an ordered priority list R,(n) 
according to the descending order of p ( . 

H7. [Assign the modules.] Determine the number of available processors p m =| F^(n)| . According to the 
weighted bipartite matching algorithm, assign the first p„ modules to the available processors based on the 
priority list R t (n). Record the modules in each processor, the PFT, and the PCT. 

H8. [End of mapping-schedule?] If A(n) * 0, then set n <- n + 1, and go to step H3; otherwise, continue. 

H9. [Obtain minimum cost among all mappings.] If A(n) = 0, record the mapping-schedule and the costs of this 
mapping. If a * 1, then go to step H2 to obtain another mapping; otherwise, stop and obtain the minimum cost 
among all the mappings. 

ENDHM. 

In the above HM algorithm, based on the level /, and the communication intensity of the task modules in the 
DATG, the weighting parameters, a and p, are used to evaluate the assignment priority coefficient p, which is used to 
order the priority list R t (n) (steps H5 and H6). The parameters a and P are used to weight the level and the communi- 
cation intensity of the task modules, respectively. In the extreme case, when a = 1, the ready modules are ordered 
according to their level in the task graph. Similarly, when P = 1, the ready modules are ordered according to the com- 
munication intensity of the task modules in the task graph. By suitably adjusting a and p, various orderings of the 
ready modules in the priority list at each stage of the insertion can be generated, thus providing a better mapping solu- 
tion. 

The weighted bipartite matching problem is also known as the “job- worker” assignment problem. Each 
worker must be assigned to exactly one job, and each job must have one assigned worker. If job i is assigned to 
worker j, then a benefit of the objective function is realized. It is desirable to make an assignment such that the total 
benefit of the objective function is optimized. In our mapping problem, the jobs and workers are, respectively, 
corresponding to the ready modules and the available processors at each stage of the insertion. Since there are p„ 
available processors at each stage of the insertion, the weights or benefits produced by job i (or ready module i) 
assigned to worker j (or available processor J) is not simply a scalar value. Instead, the weights are expressed in a 
p m -tuple, each of the kth element in this -tuple is the benefit obtained by the kth processor in this assignment. We 
want to minimize the maximum element of this p m -tuple to obtain the minimum cost of the assignment. Before we 
formulate the weighted bipartite matching algorithm, two operations are defined on the ordered p-tuples. Let 
D = (di ,d 2 , • • • , dp) be an ordered p-tuple and U = (u ] , u 2 , • • • , u p ) be an another ordered p-tuple. Then, the 
sum operation of these two ordered p- tuples, written as D © U, results in another p-tuple, W, 

W = (w x ,w 2 , ••• ,w p ) = DQU^(d l +u i ,d 2 +u 2 , ,d p +u p ). (8) 

The maximum value of this p-tuple, written as max p , is the maximum element in W, that is, max p 4 max w,. The sum- 
mation of a series of p-tuples is written as Y . 
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Algorithm WBM ( Weighted Bipartite Matching Algorithm). Given an ordered priority list of the ready 
modules /?,■(»), the available processors, the number of available processors, the mapping-schedule at the nth stage, 
the first p m ready modules denoted as R p “ (n), the matrix RM, the p m -tuple processor finishing time PFT p ~ (n — 1), and 
Pm "tuple interprocessor communication time P CT P ~ (n — 1), this algorithm assigns the ready modules to the avail- 
able processors such that the PCT among all the processors is minimum. 

Wl. [Obtain communication time vector.] For module y in obtain the number of its predecessors, 

located in processor k, 1 5 k <p„. Construct a p m x 1 vector; each entry of this vector is a -tuple. The *th 
element of the ith entry of this vector, k * i, is the PCT incurred on processor k if module y is assigned to pro- 
cessor i and is calculated as (J* = RM {i,k)ny±. The ith element of the ith entry of the communication time vector 

* s Z c t- This ^ due to the PCT incurred on processor i and is defined as the sum of the weights of the edges 

kmiJM 

across the boundary of the block i as we indicated in Eq. (3). 

W2. [Obtain communication time matrix.] For all the ready modules in R p ~(n), construct the p m *Pm communica- 
tion time matrix r(n) by collecting all the communication time vectors. Note that the tth column of this matrix 
corresponds to the communication time vector which is created by the module y, and this module is located at 
the ith position of the R, p ~ (n) priority list 

W3. [Obtain cost matrix.] Construct a p m xp„ cost matrix C{n) = [c iy (n)] by 

CiM) = I»© (0. • • • SFTf-in - 1), • • • ,0 )®PCT p -(n - 1)© (0, • • • ,D y , • • • ,0), (9) 

where *,7 = 1,2, ■ • • ,p m , PFT p ~{n - 1) is the ith element of the p m -tuple at the (n - l)th stage, D y is the 
execution time of module y and is located at the ith position of the p^-tuple. Note that c iy is a p„- -tuple. 

W4. [Weighted bipartite assignment problem.] Solve the module assignment problem by finding an assignment 
matrix X(n) = [x l7 (n)] t i , j = 1 , 2 , • • • , p m to 

max />- I^dj(n)xij(n) ) (10) 

i J 

P- P- 

subject to 5> iy (n) = 1 for i = 1,2, • • • ,p„, ^x tj (n) = 1 for j = 1,2, • • • , Pav , and xMn) = 1 or 0. 
y* l »*i 

W5* [Assign modules to processors.] According to the module assignment matrix X{n) y assign the modules in 
Ri (n) to the available processors. 

END WBM. 

When inserting the p„ ready modules into the mapping-schedule at the nth stage, the cost matrix C=[c, 7 (n)] in 
Eq. (9) includes the communication time the modules are going to create with the assigned modules in the A(n), the 
processor finishing time of the mapping-schedule, PFT p ~(n - 1), the interprocessor communication time of the 
mapping-schedule, PCT p ~(n — 1), and the execution time of the modules to guarantee the best assignment 

As an example, consider the mapping-schedule of the given DATG in Fig. 1 (m = 9). The task modules are to 
be executed by 3 processors (p = 3). The level number and execution time of each module are given beside each 
module in the DATG, the e r tJ and e‘j, i , j = 1,2,3, are shown in the UDPG. We use the HM algorithm and let a = 0, 

P = 1 to determine a heuristic mapping-schedule. The R(n), p m , R, p ~(n), PFT(n), and PCT(n ) associated with each 
stage of insertion are listed in Table 2. The PFT and PCT of this mapping are found to be PFT = ( 13, 15, 7 ) and 
= ( 6, 7, 7 ); the cost of this mapping is 22 units, and the Gantt chart of this suboptimal mapping is shown in Fig. 

3.2. Heuristic Mapping Algorithm with Simulated Annealing 

In order to improve the mapping solution obtained by the above HM algorithm, the simulated annealing method 
[21], which incorporates an iterative improvement scheme and the Metropolis procedure, is introduced to find a near- 
optimal mapping. Consider that one starts with an initial state (i.e., a partition of the task graph DATG) of the optimi- 
zation problem, instead of always rejecting the new state that increases the objective function, this new state with 
small conditional probability is accepted. In other words, if AE <, 0, then accept this new state; if A£ > 0, then accept 
this new state with probability e _A£/T , where AE = £(i w ) - E(S) is the amount of the increase in the maximum pro- 
cessor response time E ( S ), and S are, respectively, the new and old states, and T is a control parameter. Initially, 
one starts with a large value of T; after the system is approaching the equilibrium at this T value, then T is reduced to a 
lower value and the system will approach to another equilibrium. This procedure is stopped at a certain desirable T, or 
no more improvement can be expected. By conditionally accepting the increased £ (5^) and by varying the control 
parameter T, one can escape the pitfall of the local optimal and obtain a better mapping. 
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Applying this simulated annealing method to our mapping problem, a random state is generated when all the 
modules in the DATG are randomly assigned to the processors and the PCT of this assignment is obtained (this is the 
graph partitioning problem); then based on the modules assigned to these processors, we schedule the computation o 
these modules in the processors according to the precedence constraint imix>sed on them and the PFT of this assign- 
ment is obtained (this is the scheduling problem). To compute the objective function of a new state, one needs to 
compute both the PCT and the PFT. The PCT of a state is simply the weights of all the edges across the boundaries of 
each block while the PFT of this state is difficult to obtain because obtaining the optimal schedule is difficult. Thus, 
one may use the heuristic scheduling algorithm [11] to obtain a suboptimal schedule or derive the lower bound of the 
PFT of the schedule. Since the graph partitioning randomly assigns task modules to be executed on specific proces- 
sors, the scheduling problem that follows needs to address the restriction of executing these modules on their specific 
processors. This scheduling problem can be solved by the proposed processor-restricted dynamical-highest-levcl- 
first/most-immediate-successors-first algorithm (Algorithm PRDHLF IMISF). 

Algorithm PRDHLF/MISF. ( Processor-Restricted Dynamical Highest Level First/Most Immediate Succes- 
sors First Algorithm). Given a task graph and restricting the execution of the modules on specific processors, this 
algorithm constructs a dynamic priority list of all the modules and inserts the modules one by one into the suboptimal 
schedule. 

Dl. [Initialization.] Initially, the schedule is empty (i.e. A(n) = 0 ). Let P, be the set of modules that are pre- 
assigned to the processor i , i = 1 , 2 , • • • ,p. Let P l „f t {n), i = 1,2, • • • ,p, be the processor i having the 
minimum-finishing-time at the nth stage of the insertion. 

D2. [Determine the set R (n) and the sets R,(n).] Determine the set R (n) and obtain R,(n)=R (n)p/ 5 ,, »=1,2, • • • ,p, 
where /?,(«) is the the set of ready modules assigned to the processor i. 

D3. [Determine the levels of modules in /?,(«).] Find the level /*, /*c /?,(«), for each module in the set R,{n), 
i=l,2, • • •,/?. 

D4. [Construct the priority lists.] Construct the ith dynamic priority list of processor i in a descending order of l k . If 
the levels of the modules are tied, then the module having the largest number of immediately successive 
modules is assigned to the highest priority. 

D5. [Assign the modules.] Assign the modules to the P^,(n) on the basis of the ith priority list, i=l,2, • ■ •,/>. If 
A(n) = 0, then stop; otherwise, go to step D2. 

END PRDHLF/MISF. 

Although the PRDHLF IMISF algorithm provides a fast but suboptimal schedule, it is still very time-consuming 
to obtain the schedule when the number of task modules is large. Since our objective is to compute the PFT instead of 
finding a complete schedule, it is more desirable to find the lower bound of the PFT of the schedule. The lower bound 
of the PFT of the schedule without the restriction of executing the modules on specific processors has been discussed 
[22], The lower bound of the PFT of the schedule with the restriction of executing the modules on specific processors 
is given by 

PF1^{S) = max [x? (Tj) + q t ] (H) 

T, 6 P, 

where the notation and the proof of this lower bound can be found in [24], 

With the lower bound of the PFT in Eq. (11) and the PCT obtained from the graph partitioning, the objective 
function of a state S can be easily obtained by 

E(S) = max(PFT k B (S) + PCT k (S)). (12) 

lfiiSp 

Using the objective function in Eq. (12), the heuristic mapping algorithm with simulated annealing to obtain a near- 
optimal mapping is summarized in the following. 

Algorithm HMSA ( Heuristic Mapping Algorithm with Simulated Annealing) Given a DATG, an UDPG, the 
matrices SM and RM, and the control parameter T, this algorithm generates a near-optimal solution of the mapping 
problem. 

51. [Initialization.] Select an initial state S, and evaluate the objective function E(S) in Eq. (12). 

52. [Looping with T .] If the equilibrium with respect to the control parameter T is reached, go to step S6; otherwise 
continue. 

53. [Generate new state.] Generate a new slate S MW , and calculate £(£«*,) of this new state according to Eq. (12). 
Obtain A£ = E(S m ) - E(S). Select a random variable y, 0 < y< 1. If AE < 0, go to step S5; otherwise go to 
step S4. 
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and go to step S5; otherwise reject 


54. [AE >0.] If y < e &E>T , then accept this new state with probability e _A£/r 
this new state and go to step S3. 

55. [AE < 0 or y < e _A£ ' ,T .J Set S <- S vw . Record EiS^) and go to step S2. 

56. [Looping with a smaller value of T.\ If T > stopping criterion, set T «- 7- AT, where AT is a user-designed 
variable, and go to step S3; otherwise, continue. 

57. [Obtain the mapping.] Obtain the state which has the smallest E(S). Based on this state, schedule the modules 
according to the precedence constraint of the DATG and obtain the mapping. 

END HMSA. 

The HMSA algorithm is used to find a near-optimal mapping for the example in Fig. 1. An initial partition of 
the task graph assigns the modules [7] , T 2 , 7 3 ], [7 7 , 7 8 , 7 9 ), and (7 4 , 7 S , T 6 ) to processors 1, 2, and 3 for exe- 
cution, respectively. At this initial state, the PFTf* for i = 1 , 2 , 3 are respectively 10, 21, 15, and the PCT, are 5, 12, 
and 21. The total cost of this state is 36 units. In this example, 273 iterations were required to reach the near-optimal 
mapping. The PFTf for the final state are 12, 15, and 13, and the PCT, are 8, 7, and 9. The total cost of this final 
state is 22 units, and the partition of the task graph for this mapping assigns the modules [7 3 , 7 6 , 7 7 } , [7 5 , 7, } , and 
(7 1 ,7 2 ,7 4 ,7g} to processors 1, 2, and 3 for execution, respectively. It is interesting to note that both the HM and 
HMSA algorithms reached the same total cost of 22 units with different mapping solutions. Further computer simula- 
tions and actual implementation of the proposed algorithms on an NCUBE/ten multiprocessor system are described in 
the next section. 


4. Computer Simulations and Experimental Results 

Since the optimal mapping solution is AT’-complete, the design of computer simulations for evaluating the 
efficiency of the proposed HM and HMSA algorithms as compared with the optimal solution is constrained by the 
number of task modules and processors that our computer can process within a reasonable time. A total of 100 ran- 
dom DATGs, each with the number of modules ranging from 5 to 10, was generated for mapping onto multiprocessor 
systems with 2 to 3 processors. The ratio of average module execution time of the modules in the DATG and the 
average interprocessor communication time between paired processors, P 1C, was introduced to determine the com- 
munication overhead affecting the performance of the algorithms. In our computer simulations, all the optimal solu- 
tions were obtained by the exhaustive search method. Table 3 shows the efficiency of the HM algorithm with a PIC 
ratio ranging from 10 to 0.1 together with different numbers of processors. To evaluate the efficiency of the heuristic 
algorithms, an average relative error e over N mappings is defined as 


e = 


1 ” hi-o' t 

N h o\ 


(13) 


where /t, is the ith heuristic solution and o] is the ith optimal solution. 

From the computer simulation, when the PIC ratio is less than 1, the solutions obtained by the HM algorithm 
were found to be unsatisfactory. This is due to the fact that the HM algorithm always distributes the ready modules to 
the available processors to achieve the minimum communication costs. When the average communication time is 
much greater than the average module execution time or the communication link density is higher, evenly distributing 
the ready modules to all the available processors will not produce a better mapping. Instead, it may happen that all the 
ready modules may be assigned to the same processor to reduce the large average communication time and sacrifice 
relatively small average module execution time to yield a better response. Thus, when the PIC ratio is less than one, a 
sequential computation of all the task modules on a single processor will yield a better result than parallel processing. 
Hence the performance of a multiprocessor system is closely related to the P 1C ratio. To improve the performance of 
a multiprocessor system, it is suggested that the computational task should be decomposed such that the PIC ratio is 
greater than one. In our computer simulation, when the P 1C ratio is greater than one, all the solutions obtained by the 
HM algorithm approached to the near-optimal solutions. The mapping solutions obtained from the HMSA algorithm 
for all the 100 DATGs agreed with the optimal solutions obtained by the exhaustive search method. 

To further evaluate and validate the effectiveness of our proposed mapping algorithms, the computation of the 
robot inverse dynamics based on the NE task graph in Fig. 3 is implemented on an NCUBE/ten multiprocessor system. 
Once the sending and receiving processors are given, the path is obtained by the operating system of the machine. 
The performance of communication activities between two processors in the hypercube computer has been addressed 
[23]. It is shown that the hypercube machine has a high communication cost compared with the processing time of 
computing elementary multiplication or addition operation. For transferring a 12-byte (or 96-bit)t message, the ratio 


t For a functional decomposition, each entity of Lhe computation is 3 x 1 vector (e.g., etc.) and each scale value requires a 4-byte (32-bit) 
computation; thus the standard size to transfer a vector is 12 bytes. 
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of multiplication (or addition) operation and the message passing between two adjacent processors is approximately 
PIC = 50/510 c 1. This indicates that for implementation on NCUBE machines, it is not advisable to decompose the 
NE equations of motion into elementary operations. Thus, we perform functional decomposition on the NE equations 
of motion which results in the task graph in Fig. 3. The communication time between two processors for our 
NCUBE/ten machine has been measured experimentally and is found to be 0.34 milliseconds for sending a 12-byte 
message from a source processor to any destination processor. For receiving the message, we found that it takes 1.02, 
1.7, and 2.2 milliseconds to receive a 12-byte message between one, two, and three hops$, respectively. 

A recent result indicates that it takes 24.8 milliseconds to compute the robot inverse dynamics on a uniprocessor 
system [8]. Due to the high communication time, if we implement the robot inverse dynamics computation on the 
hypcrcube computer, it may not generate a faster result than that running on a uniprocessor computer. In order to 
avoid mapping all the modules to a single processor by the proposed heuristic algorithms, we artificially set the execu- 
tion time of a 12-byte elementary operation (multiplication or addition) to 3.4 milliseconds which is ten times of the 
time for sending a 12-byte message. Based on the NE task graph in Fig. 3, for a 6-jointed PUMA-like manipulator, 
the NE equations can be decomposed into 145 task modules. The detailed computation time and precedence relation- 
ship of each module can be found in [24]. Using the sending time matrix SM and the receiving time matrix RM of a 
subcube of four processors of our NCUBE/ten computer, we simulated on a VAX-1 1/780 computer the HM algorithm 
for finding our mapping solution. Based on the 145-module NE task graph, a suboptimal mapping was obtained from 
this computer simulation. The minimum processor response time of each processor from the suboptimal mapping was 
found to be 704.82, 701.42, 706.52, and 703.12 milliseconds for processor 0, 1, 2, and 3, respectively. We then imple- 
mented the mapping solution from the computer simulation into the actual NCUBE/ten machine. The minimum pro- 
cessor response time of each processor from the mapping was found to be 679.71, 676.69, 675.78, and 669.23 mil- 
liseconds for processor 0, 1,2, and 3, respectively. Table 4 compares the computer simulation result with the actual 
implementation result and shows a maximum of five percent relative error between them. 

In order to achieve the critical-path-length computation of the 145-module NE task graph, the hypercube dimen- 
sion was increased from N = 2 to N = 3; that is, 8 processors were used for the computations. The details of the allo- 
cation of the 145 modules in each of the eight processors can be found in [24]. Using p = 8 processors, our HM algo- 
rithm achieves the critical-path-length computation of 400.18 milliseconds which incurred in processor number 3; 
with a communication time of 27.86 milliseconds, it gave a shortest possible response time of 428.04 milliseconds. 
The critical-path-length computation of 400.18 milliseconds means that the use of more than 8 processors for parallel 
processing will never obtain a shorter processing time. The comparison of the computer simulation result and the 
implementation result is shown in Table 5. 

5. Conclusions 

The problem of determining the optimal mapping between a computational task consisting of m modules with 
precedence constraint and a set of connected p identical processors with interprocessor communication time is formu- 
lated as an equivalent problem of solving the graph partitioning and the scheduling problems. Both of these problems 
are known to be AT^-complete. Two efficient mapping algorithms, the HM algorithm and the HMSA algorithm, were 
proposed to determine fast and suboptimal or near-optimal mapping solutions. Minimizing the maximum of the sum 
of the processor finishing time and the interprocessor communication time is used as an optimization criterion for 
determining the best mapping. The proposed HM algorithm and the HMSA algorithm greatly reduce the time com- 
plexity of determining the mapping. Computer simulation results indicated that the proposed mapping algorithms are 
efficient and practical and that they can provide suboptimal as well as near-optimal solutions. Computer simulation 
was also conducted to simulate the operation of an NCUBE/ten hypercube computer for determining the mapping of 
the robot inverse dynamics computation of a 6-jointed PUMA-like manipulator. Actual implementation of the HM 
algorithm on the NCUBE/ten hypercube computer was also performed. 
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Figure L (a) A direct task graph. 

(b) An undirected processor graph with SM and AAf matrices. 
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Figure 4. The Gantt chart for the Fig. 1 example. 


Figure 2. Partitioning a task graph into three blocks. 
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Figure 3. Task graph of Newton- Euler equations of motion. 
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Table 1. The PFT of the mapping-schedule in Fig. 1 example. 


Insertion 

Stage 

PFT x {n) 

PFT 2 (n) 

PFT,(n) 

1 

2 

2 

2 

2 


6 

6 

3 

7 

7 

7 

4 

10 

10 




11 

ii 



13 

13 



17 


8 


20 


9 


22 



Table 2. Cost value of each insertion stage for the Fig. 1 example. 
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Table 3. Simulation results of the HM algorithm with different P !C ratios. 


P/C 

Optimality 
Percentage (%) 

Relative 
Error (%) 

Number of 

Number of 

Number of 

Number of 


Processors = 2 

Processors = 3 

Processors = 2 

Processors = 3 


71.43 

90.48 

3.05 

0.44 

5 

71.43 

90.48 

3.67 

0.29 

2 

71.43 

95.24 

5.61 

0.05 

1 

52.38 

66.67 

13.47 

3.09 

MEM 

30.77 

15.38 

17.34 

16.65 

■S3 

0 

0 

102.02 

92.26 

1 o.i 

0 

0 

224.32 

254.36 


Table 5. Comparison between simulation and implementation 
results for the eight-prpcessor case. 


Processor 

Number 

Simulation 

Result 

Hypercube 

Result 

Relative 
Error (%) 

0 

379.70 ms 

361.53 ms 

4.79 

1 

330.08 ms 

309.85 ms 

6.13 

2 

383.12 ms 

359.46 ms 

6.17 

3 

428.04 ms 

415.10 m 

3.02 

4 

406.98 ms 

400.10 m 

1.70 

5 

376.50 ms 

360.42 ms 

4.27 

6 

376.16 ms 

354.28 ms 

5.82 

7 

374.50 ms 

355.36 ms 

5.11 
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